12 

Introduction to Graphical 
Modelling 



Marco Scutari x and Korbinian Strimmer 2 
O , 

1 Genetics Institute, University College London (UCL), London, UK 

2 Institute for Medical Informatics, Statistics and Epidemiology 
i^ \ (IMISE), University of Leipzig, Germany 

OO 

The aim of this chapter is twofold. In the first part (Sections 112.11 112.21 and 112.31 1 

we will provide a brief overview of the mathematical and statistical foundations 

of graphical models, along with their fundamental properties, estimation and basic 

inference procedures. In particular we will develop Markov networks (also known as 

Markov random fields) and Bayesian networks, which are the subjects of most past 

and current literature on graphical models. In the second part (Section ri2.4l) we will 

review some applications of graphical models in systems biology. 
CO 
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12.1 Graphical Structures and Random Variables 

Graphical models are a class of statistical models which combine the rigour of a 
probabilistic approach with the intuitive representation of relationships given by 
l/*~) ' graphs. They are composed by two parts: 

o , 

^^ , 1 . a set X = {Xi , X2 , ■ ■ ■ , X p } of random variables describing the quantities of 

interest. The statistical distribution of X is called the global distribution of the 
data, while the components it factorises into are called local distributions. 

•i-H , 

K^ , 2. a graph Q — (V, E) in which each vertex v £ V, also called a node, is 

!_h ' associated with one of the random variables in X (they are usually referred 

to interchangeably). Edges e £ E, also called links, are used to express the 
dependence structure of the data (the set of dependence relationships among 
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the v ariables in X) with different semantics for und irected graphs (IDiestell 



2005) and directed acyclic graphs (Ba ng- Jensen and G utin 2009) 



The scope of this clas s of models and the versatility of its definition are well 



expressed bv lPearll ([1988) in his seminal work 'Probabilistic Reasoning in Intelligent 



Systems: Networks of Plausible Inference': 

Graph representations meet our earlier requirements of explicitness, 
saliency, and stability. The links in the graph permit us to express 
directly and quantitatively the dependence relationships, and the graph 
topology displays these relationships explicitly and preserves them, 
under any assignment of numerical parameters. 

The nature of the link outlined above between the depende nce structure of the data 
and its graphical representation is given again by Peary (119881) in terms of conditional 



independence (denoted with _Lp) and graphical separation (denoted with JL G ). 

Definition 12.1.1 A graph Q is a dependency map (or D-map) of the probabilistic 
dependence structure PoflLif there is a one-to-one correspondence between the 
random variables in X and the nodes V of Q, such that for all disjoint subsets A, 
B, C o/X we have 

AiL P B|C =^> A_L G B|C. (12.1) 

Similarly, Q is an independency map (or l-map) of P if 

AiL P B|C^= A_L G B|C. (12.2) 

Q is said to be a perfect map of ' P if it is both a D-map and an l-map, that is 

A_L P B|C^=^ AiL G B|C, (12.3) 

and in this case P is said to be isomorphic to Q. 

Note that this definition does not depend on a particular characterisation of 
graphical separation, and therefore on the type of graph used in the graphical model. 
In fact both Markov networks ( Whittaker 1990) and Bayesian networks (Pearl 1988), 
which are by far the two most common classes of graphical models treated in 
literature, are defined as minimal I-maps even though the former use undirected 
graphs an the latter use directed acyclic graphs. Minimality requires that, if the 
dependence structure P of X can be expressed by multiple graphs, we must use 
the one with the minimum number of edges; if any further edge is removed then 
the graph is no longer an l-map of P. Being an l-map guarantees that two disjoint 
sets of nodes A and B found to be separated by C in the graph (according to 
the characterisation of separation for that type of graph) correspond to independent 
sets of variables. However, this does not mean that every conditional independence 
relationship present in P is reflected in the graph; this is true only if the graph is also 
assumed to be a dependency map, making it a perfect map of P. 
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In Markov netw orks graphica l separation (which is called undirected separation 
or u-separation in lCastillo et al.l (1 19971) ) is easily defined due to the lack of direction 



of the links. 

Definition 12.1.2 If A., B and C are three disjoint subsets of nodes in an undirected 
graph Q, then C is said to separate A from B, denoted A JLg B | C, if every path 
between a node in A and a node in B contains at least one node in C. 

In Bayesian networks separation takes the na me of directe d separation (or 
d-separation) and is defined as follows (Korb and Nicholson 2009). 

Definition 12.1.3 If A, B and C are three disjoint subsets of nodes in a directed 
acyclic graph Q, then C is said to d-separate A from B, denoted A JLq B | C, if 
along every path between a node in A and a node in B there is a node v satisfying 
one of the following two conditions: 

1. v has converging edges (i.e. there are two edges pointing to vfrom the adjacent 
nodes in the path) and none ofv or its descendants (i.e. the nodes that can be 
reached from v) are in C. 

2. v is in C and does not have converging edges. 

A simple application of these definitions is illustrated in Figure [T2TI We can see 
that in the undirected graph on the top A and B are separated by C, because there 
is no edge between A and B and the path that connects them contains C; so we 
can conclude that A is independent from B given C according to Definition ll2.1.2l 
As for three directed acyclic graphs, which are called the converging, serial and 
diverging connections, we can see that only the last two satisfy the conditions stated 
in Definition ll2.1.3l In the converging connection C has two incoming edges (which 
violates the second condition) and is included in the set of nodes we are conditioning 
on (which violates the first condition). Therefore we can conclude that C does not 
d-separate A and B and that according to the definition of I-map we can not say that 
A is independent from B given C. 

A fundamental result descending from the definitions of separation and 
d-separation is the Markov property (or Markov condition), which defines 
the decomposition of the global distribution of the data into a set of local 
distributions. For Bayesian networks it is related to the chain rule of probability 



(IKorb and Nicholsonll2009l) : it takes the form 



P(X) = Y[ PPQ I n Xi ) for discrete data and (12.4) 



v 



/(X) = Y[ f{X t | n Xi ) for continuous data, (12.5) 



s=l 



so that each local distribution is associated with a single node Xi and depends 
only on the joint distribution of its parents Hx t ■ This decomposition holds for any 
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separation (undirected graphs) 




A _1LB|C 
P(A, B, C) = P(A | C) P(B | C) P(C) 



d-separation (directed acyclic graphs) 




A 





A^BIC 
P(A, B, C) = P(C | A, B) P(A) P(B) 

A JLB|C 
P(A,B,C) = 

= P(B|C)P(C|A)P(A) 
= P(A|C)P(B|C)P(C) 



Figure 12.1 Graphical separation, conditional independence and probability factorisation 
for some simple undirected and directed acyclic graphs. The undirected graph is a simple 
3 node chain, while the directed acyclic graphs are the converging, serial and diverging 
connections (collectively known as fundamental connections in the theory of Bayesian 
networks). 



Bayesian network, regardless of its graph structure. In Markov networks on the other 
hand local distributions are associated with the cliques (maximal subsets of nodes in 
which each element is adjacent to all the others) Ci, C2, ■ • ., C& present in the 
graph; so 



p(X) = IJiMCi) 

/(x)=n^(Q) 



for discrete data and 



for continuous data. 



(12.6) 
(12.7) 



The functi ons 1^1,1^2, ■■■ ,^k are called Gibbs ' potentials dPearll 119881) . factor 
potentials dCastillo et alJI 19971) or simply potentials, and are non-negative functions 
representing the relative mass of probability of each clique. They are proper 
probability or density functions only when the graph is decomposable or 
triangulated, that is when it contains no induced cycles other than triangles. With 
any other type of graph inference becomes very hard, if possible at all, because 
if)i,if>2, . . . , tpk have no direct sta tistical interpretation. Decomposable graphs are 
also called chordal (iDiestell 120 05) because any cycle of length at least four has a 
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chord (a link between two nodes in a cycle that is not contained in the cycle itself). 
In this case the global distribution factorises again according to the chain rule and 
can be written as 

Y\ k PfC ■) 
P(X)= lii = 1 v tJ for discrete data and (12.8) 

n- =1 p(so 

rr fe f(C) 
f(X) = iu = lJy l > for continuous data, (12.9) 

n- =1 /(so 

where Sj a re the nodes of Cj which are also part of any other clique up to Cj_i 
dPearllll988h . 



A trivial application of these factorisations is illustrated again in Figure 112.11 
The Markov network is composed by two cliques, C\ = {A, C} and Ci = {B, C}, 
separated by R\ = {C}. Therefore according to Equation ll2.6l we have 

P (X) = P iA p } ( ^ (i? ' C) = P (A | C) P (B | C) P (C) . (12.10) 

In the Bayesian networks we can see that the decomposition of the global distribution 
results in three local distributions, one for each node. Each local distribution is 
conditional on the set of parents of that particular node. For example, in the 
converging connection we have that n^ = {0}, Hb = {0} and lie* = {A, B}, so 
according to Equation ll2.4l the correct factorisation is 

P(X)=P(A)P(B)P(C|A,5). (12.11) 

On the other hand, in the serial connection we have that 11^ = {0}, Hb — {C} and 

n c = {A}, so 

P(X)=P(A)P(C\A)P(B\C). (12.12) 

The diverging connection can be shown to result in the same factorisation, even 
though the nodes have different sets of parents than in the serial connection. 

Another fundamental result descending from the link between graphical 
separation and probabilistic independence is the definition of the Markov blanket 
(IPearllll988h of a node Xi, the set that completely separates Xi from the rest of 
the graph. Generally speaking it is the set of nodes that includes all the knowledge 
needed to do inference on X,, from estimation to hypothesis testing to prediction, 
because all the other nodes are conditionally independent from Xi given its Markov 
blanket. In Markov networks the Markov blanket coincides with the neighbours of 
Xi (all the nodes that are connected to X- by an edge); in Bayesian networks it is the 
union of the children of Xi, its parents, and its children's other parents (see Figure 
1 12.2b . In both classes of models the usefulness of Markov blankets is limited by the 
sparseness of the network. If edges are few compared to the number of nodes the 
interpretation of each Markov blanket becomes a useful tool in understanding and 
predicting the behaviour of the data. 
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Bayesian network 



Markov network 




Children's other 
parents 



Markov blanket Neighbours 



Figure 12.2 The Markov blanket of the node A in a Bayesian network (on the left) and in 
the corresponding Markov network given by its moral graph (on the right). The two graphs 
express the same dependence structure, so the Markov blanket of A is the same. 



The two characterisations of graphical separation and of the Markov properties 
presented above do not appear to be closely related, to the point that these two classes 
of graphical models seem to be very different in construction and interpretation. 
There are indeed dependency models that have an u n direct ed perfect map but 
not a directed acyclic one, and vice versa (see iPearll y988), pages 126 - 127 
for a simple example of a dependency struct ure that cannot be represented as a 
Bayesian network). However, it can be shown (ICastillo et all 1 997t IPearlll 19881) that 
every dependency structure that can be expressed by a decomposable graph can be 
modelled both by a Markov network and a Bayesian network. This is clearly the 
case for the small networks shown in Figure [T2~2l as the undirected graph obtained 
from the Bayesian network by moralisation (connecting parents which share a 
common child) is decomposable. It can also be shown that every dependency model 
expressible by an undirected graph is also expressible by a directed acyclic graph, 
with the addition of some auxiliary nodes. These two results indicate that there is a 
significant overlap between Markov and Bayesian networks, and that in many cases 
both can be used to the same effect. 



12.2 Learning Graphical Models 

Fitting graphical models is called learning, a term borrowed from expert systems 
and artificial intelligence theory, and in general requires a two-step process. 

The first step consists in finding the graph structure that encodes the conditional 
independencies present in the data. Ideally it should coincide with the minimal 
I-map of the global distribution, or it should at least identify a distribution 
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as close as possible to the correct one in the probab ility space. This step is 
called network structure o r simply structure learning (lKoller and Friedman! l2009t 



Korb and Nicholsonll2009l) . and is similar in approaches and terminology to model 



selection procedures for classical statistical models. 

The second step is called parameter learning and, as the name suggests, deals 
with the estimation of the parameters of the global distribution. This task can be 
easily reduced to the estimation of the parameters of the local distributions because 
the network structure is known from the previous step. 

Both structure and parameter learning are often performed using a combination 
of numerical algorithms and prior knowledge on the data. Even though significant 
progress have been made on performance and scalability of learning algorithms, an 
effective use of prior knowledge and relevant theoretical results can still speed up 
the learning process severalfold and improve the accuracy of the resulting model. 
Such a boost has been used in the past to overcome the limitations on computational 
power, leading to the deve lopment of the so-calle d expert sy stems (for real-world 
examples see the MUNIN (Andreasse n'et al.lll989h . ALARM (Bein lich etal.lll989h 



and Hailfinder ( Abramson et al. 1996) networks); it can still be used today to tackle 
larger and larger problems and obtain reliable results. 

12.2.1 Structure learning 

Structure learning algorithms have seen a steady development over the past 
two decades thanks to the increased availability of computational power and 
the application of many results from probability, information and optimisation 
theory. Despite the (sometimes confusing) variety of theoretical backgrounds and 
terminology they can all be traced to only three approaches: constraint-based, score- 
based and hybrid. 

Constraint-based algorithms use statistical tests to learn conditional independence 
relationships (called constraints in this setting) from the data and assume that the 
graph underlying the probability distribution is a perfect map to determine the correct 
network structure. They have been developed originally for Bayesian networks, but 
have been recentl y applie d to Markov networks as well (see for example the Grow- 



Shrink algorithm (Bromberg et al. 2009; Margaritis 2003), which works with minor 



modifications in both case s). Their main limitations are th e lack of control of either 
the family-wise error rate (IDudoit and van der Laanl2007l) or the false discovery rate 



(Efron 2008) and the additional assumptions needed by the tests themselves, which 
are often asymptotic and with problematic regularity conditions. 

Score-based algorithms are closer to model selection techniques developed in 
classical statistics and information theory. Each candidate network is assigned a 
score reflecting its goodness of fit, which is then taken as an objective function to 
maximise. Since the number of both undirected graphs and directed acyclic graphs 



grows more than exponentially in the number of nodes dHararv an d Palmer 1973) 
an exhaustive search is not feasible in all but the most trivial cases. This has led to 
an extensive use of heuristic optimisation algorithms, from local search (starting 
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from an initial network an d changing one edge at a time) to genetic algorithms 



(Russell and Norvig 2009). Convergence to a global maximum however is not 
guaranteed, as they can get stuck into a local maximum because of the noise present 
in the data or a poor choice in the tuning parameters of the score function. 

Hybrid algorithms use both statistical tests and score functions, combining the 
previous tw o families of algorithms . The general approach is described for Bayesian 
networks in Friedman et al. ( 1999b), and has proved one of the top performers up to 
date in lTsamardinos et al . (2006). Conditional independence tests are used to learn at 
least part of the conditional independence relationships from the data, thus restricting 
the search space for a subsequent score-based search. The latter determines which 
edges are actually present in the graph and, in the case of Bayesian networks, their 
direction. 

All these structure learning algorithms operate under a set of common 
assumptions, which are similar for Bayesian and Markov networks: 



• 



there must be a one-to-one correspondence between the nodes of the graph 
and the random variables included in the model; this means in particular that 
there must not be multiple nodes which are functions of a single variable. 

there must be no unobserved (also called latent or hidden) variables that 
are parents of an observed node in a Bayesian network; otherwise only part 
of the dependency structure can be observed, and the model is likely to 
include spurious edges. Specific algorithms have been developed for this 
particular case, typically based on B ayesian posterior d istribution s or th e 



EM algorithm ( Dempster et alj 1 19771); see f or ex ample iFriedmanl dl997l) . 
Elidan and Friedmanl ( bo05l) and binder et alj dl997l) . 



• all the relationships between the variables in the network must be conditional 
independencies, because they are by definition the only ones that can be 
expressed by graphical models. 

• every combination of the possible values of the variables in X must represent 
a valid, observable (even if really unlikely) event. This assumption implies 
a strictly positive global distribution, which is needed to have uniquely 
determined Markov blankets and, therefore, a uniquely identifiable model. 
Constraint-based algorithms work even when this is not true, because the 
existence of a perfect map is also a sufficient condition for the uniqueness 
of the Markov blankets dPearllll988l) . 

Some additional assumptions are needed to properly define the global distribution of 
X and to have a tractable, closed-form decomposition: 

• observations must be stochastically independent. If some form of temporal 
or spatial dependence is present it must be specifically accounted for 
in the definition of the network, as in dynamic Bayesian networks 
(Koll er and Friedmanl 120091) . They will be covered in Section 112.4.41 and 
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Figure 12.3 Factorisation of the ASIA Bayesian network from lLauritzen and Spiegelhalten 
( 1988) into local distributions, each with his own conditional probability table. Each row 
contains the probabilities conditional on a particular configuration of parents. 
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if all the random variables in X are discrete or categorical both the global 
and the local distributions are assumed to be multinomial. This is by far 
the most common assumption in literature, at least for Bayesian n etworks, 
because of its strong tie s with the analysis of contingency tables dAgresti 
l2002tlBishop et al.ll2007i) and because it allows an easy representation of local 
distributions as conditional probability tables (see Figure [T2~3l >. 

if on the other hand all the variables in X are continuous the global distribution 
is usually assumed to follow a multivariate Gaussian distribution, and the 
local distributions are either univariate or multivariate Gaussian distributions. 
This assumption defines a subclass of graphical models called graphica l 
Gaussian models (GGMs), which overlaps both Markov (Whittaker 1990) 
and Bayesian networks (Neapolitan 2004). A classical example from Edwardsl 
(2000) is illustrated in Figure [HH 



• if both continuous and categorical variables are present in the data there 
are three possible choices: assuming a mixture or conditional Gaussian 
distribution (B0ttcher 2004; Edwards 2000), discretising continuous attributes 
(Friedman and Goldszmidt 1996) or using a nonparametric approach 



Bach and Jordan 2003). 



The form of the probability or density function chosen for the local distributions 




1.000 0.332 0.235 
0.332 1.000 0.327 
0.235 0.327 1.000 



1.000 0.451 0.364 
0.451 1.000 0.256 
0.364 0.256 1.000 



Figure 12.4 The MARKS Graphical Gaussian network from lEdwardsl ( 120001) and its 
decomposition into cliques, the latter characterised by their partial correlation matrices. 
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determines which score functions (for score-based algorithms) or conditional 
independence tests (for constraint-based algorithms) can be used by structure 
learning algorithms. Common choices for conditional independence tests are: 

• discrete data: Pearson's x 2 and the G 2 tests (IAgrestill2002t lEdwardsl l2000). 
either as asymptoti c or permutation tests. The G 2 test is actually a log- 
likelihood ratio test (Lehmann and Romano 2005) and is equivalent to mutual 
information tests dCover and Thomasll2006l) up to a constant. 

• continuous data: Student's i, Fisher's Z and the log-likelihood ratio test s 
based on partial correlation coefficients dLegendrduOOOt Neapolitan! 120041) . 
again either as asymptotic or permutation tests. The log-likelihood ratio test is 
equivalent to the corresponding mutual information test as before. 

Score functions commonly used in both cases are penalised likelih ood scores such 
as the Akaike and B ayesian Information criteria (AIC and BIC, see lAkaikd (119741) 
and lSchwara dl978l) respectively), posterior densities such as the Bayesi a n Diri chlet 
and Gaussian equivalent scor es (BDe and BGe, see iHeckerman et al.1 d 19951) and 
Geiger and Heckerm an ( 1994) respectively) and entropy-based measures such as the 
Minimum Description Length (MDL) by Rissanen (2007). 

The last important property of structure learning algorithms, one that sometimes 
is not explicitly stat ed, is their inabili ty to discriminate between score equivalent 
Bayesian networks dChickerind 1 1 995b - Such models have the same skeleton (the 
undirected graph resulting from ignoring the direction of every edge) and the same 
v-structures (another name for the converging connection illustrated in Figure fTOT ). 
and therefore they encode the same conditional independence relationships because 
every d-separation statement that is true for one of them also holds for all the others. 

This characterisation implies a partitioning of the space of the possible networks 
into a set of equivalence classes whose elements are all I-maps of the same 
probability distribution. The elements of each of those equivalence classes are 
indistinguishable from each other without additional information, such as a non- 
uniform prior distribution; their factorisations into local distributions are equivalent. 




Figure 12.5 Two score equivalent Bayesian networks (on the left and in the middle) and the 
partially directed graph representing the equivalence class they belong to (on the right). Note 
that the direction of the edge D — > E is set because its reverse E — > D would in troduce tw o 
additional v-structures in the graph; for this reason it is called a compelled edge (Pearl 1988). 
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Statistical tests and almost all score functions (which are in turn called score 
equivalent functions), including those detailed above, are likewise unable to choose 
one model over an equivalent one. This means that learning algorithms, which 
base their decisions on these very tests and scores, are only able to learn which 
equivalence class the minimal I-map of the dependence structure belongs to. They 
are usually not able to uniquely determine the direction of all the edges present in the 
network, which is then represented as a partially directed graph (see Figure [T2~5l l. 



12.2.2 Parameter learning 



Once the structure of the network has been learned from the data the task of 
estimating and updating the parameters of the global distribution is greatly simplified 
by the application of the Markov property. 

Local distributions in practise involve only a small number of variables; 
furthermore their dimension usually does not scale with the size of X (and is often 
assumed to be bounded by a constant when computing the computational complexity 
of algorithms), thus avoiding the so called curse of dimensionality . This means that 
each local distribution has a comparatively small number of parameters to estimate 
from the sample, and that estimates are more accurate due to the better ratio between 
the size of parameter space and the sample size. 

The number of parameters needed to uniquely identify the global distribution, 
which is the sum of the number of parameters of the local ones, is also reduced 
because the conditional independence relationships encoded in the network structure 
fix large parts of the parameter space. For example in graphical Gaussian models 
partial correlation coefficients involving (conditionally) independent variables are 
equal to zero by definition, and joint frequencies factorise into marginal ones in 
multinomial distributions. 

However, parameter estimation is still problematic in many situations. For 
example it is increasingly common to have sample size much smaller than the 
number of variables included in the model; this is typical of microarray data, 
which have a few tens or hundreds observations and thousands of genes. Such 
a situation, which is called "small n, large p", leads to estimates with high 
variability unless particular care is taken both in structure and parameter learning 



(Castelo and Roverato 2006; Hastie et al. 2009; Schafer and Strimmer 2005a). 



Dense networks, which have a high number of edges compared their nodes, 
represent another significant challenge. Exact inference quickly becomes unfeasible 
as the number of nodes increases, and even approximate procedures based on 
Monte Car lo simulations and b o otstra p resampling require l arge c omputational 
resources dKoller and Friedman! |2009| ; iKorb and Nicho lson 2009). Numerical 



problems stemming from floating point approximations dGoldbera Il991|) and 
approximate numerical algorithms (such as the ones used in matrix inversion and 
eigenvalue computation) should also be taken into account. 
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12.3 Inference on Graphical Models 

Inference procedures for graphical models focus mainly on eviden ce propagation 



and model validation, e ven though other asp ects such as robustness (Cozman 19971) 



and sensitivity analysis (iGomez-Villegas et a l. 2008) have been studied for specific 
settings. 

Evidence propagation (another term borrowed from expert systems literature) 
studies the impact of new evidence and beliefs on the parameters of t he model. For 



this reason it is also referred to as belief propagation or belief updating dCastillo et al 



1997c iPearll 11988b . and has a clear Bayesian interpretation in terms of posterior 
and conditional probabilities. The structure of the network is usually considered 
fixed, thus allowing a scalable and efficient updating of the model through its 
decomposition into local distributions. 

In practise new evidence is introduced by either altering the relevant parameters 
{soft evidence) or setting one or more variables to a fixed value (hard evidence). The 
former can be thought of as a model revision or parameter tuning process, while the 
latter is carried out by conditioning the behaviour of the network on the values of 
some nodes. The process of computing such conditiona l probabilities is also know n 



as conditional probability query on a set of query nodes JKoller and Friedm an 2009) 



and can be performed with a wide selection of exact and approximate inference 
algorithms. Two classical examples of exact algorithms are variable elimination 
(optionally applied to the clique tree form of the network) and Kim and Pearl's 
Message Passing algorithm. Approximate algorithms on the other hand rely on 
various forms of Monte Carlo sampling such as forward sampling (also called logic 
sampling for Bayesian networks), likelihood-weighted sampling and importance 
sampling. M arkov Chain Monte Carlo methods such as Gibbs sampling are also 
widely used jKorb and Nicholsonll2009l) . 

Model validation on the other hand deals with the assessment of the performance 
of a graphical model when dealing with new or existing data. Common measures 
are the goodness-of-fit scores cited in the Section 112.2.11 or any appropriate loss 
measure such as misclassification error (for discrete data) and the residual sum of 
squares (for continuous data). Their estimati on is usually carried out us i ng either a 
separ ate testing data set or cross validation (iKoller and Friedmanll2009t IPefia et al 



2005) to avoid negatively biased results. 

Another nontrivial probl em is to determ ine the confidence level for particular 
structural features. In Friedman et al.l dl999al) this is accomplished by learning a 
large number of Bayesian networks from bootstrap samples drawn from the original 



data se t and estimating the empirical frequency of the features of interest. IScutari 



(1201 II) has recently extended this approach to obtain some univariate measures 



of variability and perform some basic hypothesis testing. Both techniques can be 



applied to Markov networks with little to no change. lTian an d He (2009) on the other 
hand used a non-uniform prior distribution on the space of the possible structures to 
compute the exact marginal posterior distribution of the features of interest. 
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12.4 Application of Graphical Models in Systems Biology 

In systems biology graphical models are employed to describe and to identify 
interdependencies among genes and gene products, with the eventual aim to better 
understand the molecular mechanisms of the cell. In medical systems biology the 
specific focus lies on disease mechanisms mediated by changes in the network 
structure. For example, a general assumption in cancer genomics is that there are 
different active pathways in healthy compared to affected tissues. 

A common problem in the practical application of graphical models in systems 
biology is the high dimensionality of the data compared to the small sample size. In 
other words, there are a large number p of variables to be considered whereas the 
number of observations n is small due to ethical reasons and cost factors. Typically, 
the number of parameters in a graphical model grows with some power of the number 
of variables. Hence, if the number of genes is large, the parameters describing the 
graphical model (e.g. edge probabilities) quickly outnumber the data points. For this 
reason graphical modelling in systems biology almost always requires some form of 
regularised inference, such as Bayesian inference, penalised maximum likelihood or 
other shrinkage procedures. 



12.4.1 Correlation networks 

The simplest g raphical models used in systems biology are relevance networks 



(Butte et al.l |2000), which are also known in statistics as correlation graphs. 
Relevance networks are constructed by first estimating the correlation matrix for 
all p(p — l)/2 pairs of genes. Subsequently, the correlation matrix is thresholded at 
some prespecified level, say at |ry | < 0.8, so that weak correlations are set to zero. 
Finally, a graph is drawn in order to depict the remaining strong correlations. 

Technically, correlation graphs visualise the marginal (in)dependence structure 
of the data. Assuming the latter are normally distributed, a missing edge between 
two genes in a relevance network is indicative of marginal stochastic independence. 
Because of their simplicity, both in terms of interpretation as well as computation, 
correlation graphs are enormously popular, not only for analy sing gene expression 



profiles but also many other kinds of omics data (Steue 



or analys 
jr 2006). 



12.4.2 Covariance selection networks 

The simplest graphical model that considers c onditional rathe r than marginal 
dependencies is the covariance selection model (Demps terlll972l). al so known as 
concentration graph or graphical Gaussian model ( Whittakerl 1 1 9901) . In a GGM 
the graph structure is constructed in the same way as in a relevance network; 
the only difference is that the presence of an edge is determined by the value 
of the corresponding partial correlation (the correlation between any two genes 
once the linear effect all other p — 2 genes has been removed) instead of the 
marginal correlation used above. Partial correlations may be computed in a number 
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of ways, but the most direct approach is by inversion and subsequent standardisation 
of the correlation matrix (the inverse of the covariance matrix is often called a 
concentration matrix). Specifically, it can be shown that if an entry in the inverse 
correlation matrix is close to zero then the partial correlation between the two 
corresponding genes also vanishes. Thus, under the normal data assumption a 
missing edge in a GGM implies conditional independence. 

Partial correlation graphs derived from genomic data are often called gene 
association networks, to distinguish them from correlation-based relevance 
networks. Despite their mathematical simplicity, it is not trivial to learn GGMs 



from high-dimensional "small n, large p" genomic data (Schaf er and Strimmei 



2005c). There are two key problems. First, inferring a large-scale correlation (or 
covariance) matrix from relatively few data is an ill-posed problem that requires 
some sort of regularisation. Otherwise the correlation matrix is singular and therefore 
cannot be inverted to compute partial correlations. Second, an effective variable 
selection procedure is needed to determine which estimated partial correlations are 
not significant and which represent actual linear dependencies. Typically, GGM 
model selection involves assumptions concerning the sparsity of the actual biological 
network. 

The first applications of covariance selection models to genome data were either 
restricted to a small number of variables (Waddell and Kishino 2000), used as a 
prepro cessing step in cluster a nalysis to reduce the effective dimension of the 



model (IToh and Hor imoto 2002), or employed low order parti al correlat ions as an 



approximation to fully conditioned partial correlations dde la Fuente et al.1 12004). 
However, newer inference procedures for GGMs are directly applicable to high- 
dimensional data. A Bayesian regression-based approach to learn large-scale GGMs 
is given in Do bra et al.1 (120041) . ISchafer and Strimme r (2005b) introduced a large- 
scale model selection procedure for GGMs using false discovery rate multiple 



testing with an empirically estimated null model. Scha fer and Strimmerl ([2005a) 
also proposed a James-Stein-type shrinkage correlation estimator that is both 
computationally and statistically efficient even in larger dimensions, specifically for 
use in network inference. An example of a GGM reconstructed with this algorithm 
from E. coli data is shown in Figure [T2~6l Methods for estimating large-scale inverse 
correlation matrices using diffe r ent variants of penalis ed maximum likelihood ar e 
discussed by Li and Gui (2006), Baneriee efaD d2008l) and [Friedman et al.1 J2008). 



Most recently, Andrei and Kendziorski (2009) considered a modified GGM that 
allows the specification of in teractions (i.e. multiplicative dependencies) among 
genes, and IRramer et al.l (1200 9) conducted an extensive comparison of regularised 



estimation techniques for GGMs. 



12.4.3 Bayesian networks 

Both gene relevance and gene association networks are undirected graphs. In 
order to learn about directed conditional dependencies Bayesian network inference 
procedures have been developed for static (and later also for time course) microarray 
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Figure 12.6 Partial correlation graph i nferr ed from E. coli data using the algorithm 
described in lSchafer and Strimmen 12005m and lSchafer and Strim mer (2005a). Dotted edges 
indicate negative partial correlation. 
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data. 

The application of Bayesian n etworks to learn large -scale directed graphs from 
microarray data was pioneered bylFriedman et al.1 d2000l) . and has also been reviewed 
more recently in lFriedmanl (120041) . The high dimensionality of the model means that 
inference procedures are usually unable to identify a single best Bayesian network, 
settling instead on a set of equally well behaved models. In addition, as discussed 
in Section \\2. 2. 11 all Bayesian networks belonging to the same equivalence class 
have the same score and therefore cannot be distinguished on the basis of the 
probability distribution of the data. For this reason it is often important to incorporate 
prior biological knowledge into the inference process of a Bayesian network. A 
B ayesian approach based on t he use of informative prior distributions is described 
in 



Mukherjee and Speed (2008) 



The efficiency of Bayesian networks, GGMs and relevance networks in recovering 
bi ological regulatory networks have been studied in an extensive and realistic setup 



in 



Werhli et al.1 (120061) . Not surprisingly, the amount of information contained in gene 



expression and other high-dimensional data is often too low to allow for accurate 
reconstruction of all the details of a biological network. Nonetheless, both GGMs 
and Bayesian networks are able to elucidate some of the underlying structure. 



12.4.4 Dynamic Bayesian networks 

The extension of Bayesian networks to the analysis of time course data is provided 
by dynamic Bayesian networks, which explicitly account for time dependencies 
in their definition. The incorporation of temporal aspects is important for systems 
biology, as it allows to draw conclusions about causal relations. 

Dynamic Bayesian networks are often restricted to linear systems, with two 
special (yet still very general) models: the vector-autoregressive (VAR) model and 
state-space models. The main difference between the two is that the latter includes 
hidden variables that are useful for implicit dimensionality redu ction. 



T he VAR model was first applied to genomic data by iFuiita et al.l (2007) 
and lOpgen-Rhein and Strimmerl d2007bl) . A key problem of this kind of model 
is that it is very parameter-rich, and therefore it is hard to estimate efficiently 
and reliably. Ongen-Rhein and S trimmer (2007b) proposed a shrinkage approach 
whereas IFuiita et al.l (120071) employed lasso regression for sparse VAR modelling. 
A refinemen t of the latter approach based on the elastic net penalised regression is 
described in lShimam ura et al. (2009). In all VAR models the estimated coefficients 



can be interpreted in terms of Granger causality dOpgen-Rhein and Strimmei 
2007bl) . 

State-space models are an extension of the VAR model, and include lower- 
dimensional latent variables to facilitate inference. The dimension of the latent 
variab l es is usually chose n in the order of the ran k of the data matrix. IHusmeiei 
d2003l) . |Perrin et al.l (12.0031) . and lRangel et al.l d2004l) were the first to study genomic 
data with dynamic Bayesian networks and to propose inference procedures suitable 
for use with microarray data. Bayesian learning procedures are discussed in 
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JLahdesmaki and Shmulevichl 120081) . A general state-s pace framework that allows 



to mo d el non- sta t ionary time course data is given in iGrzegorczvk and Husmeier 



(2009). Ra u et al.l (120101) present an empirical Bayes approach to learning dynamical 



Bayesian networks and apply it to gene expression data. 



12.4.5 Other graphical models 

Bayesian networks are graphical models where all edges are directed, whereas 
GGMs represent undirected conditional dependencies in multivariate data. On the 
other hand, chain graphs can include directed as well as undirected dependencies 
in same graph. One heuristic approach to infe r an approximating chain graph from 
high-dimensional genomic data is described in lOpgen-Rhein and Strimmerl (I2007al) . 



For reasons of simplicity, and to further reduce the number of parameters to be 
estimated, many graphical models used in systems biology only describe linear 
dependencies (GGM, VAR, state space models). Attempts to relax such linearity 



2007) and copula-based approaches dKim et alj |2008). 



assumptions include entropy netwo rks (Hau sser and Strimmerl l2009t iMever et al. 



Finally, sometimes time-discrete models such as dynamic Bayesian networks 
are not appropriate to study the dy namics of molecular processes. In these 
cases stochastic differential equations (IWilkinsonl 12009) often represent a viable 



alternative. It is also important to keep in mind that, given the small sample size 
of omics data, the most complex graphical model is not necessarily the best choice 
for an analyst dWerhli et al.ll2006h . 
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